TODO ## Loaded packages
## [1] "caret" "lattice" "crosstalk" "corrplot" "corrr"
## [6] "openxlsx" "plotly" "ggplot2" "formattable" "tidyr"
## [11] "dplyr" "stats" "graphics" "grDevices" "utils"
## [16] "datasets" "methods" "base"
The provided data is organized in such a way, that for each patient there are several rows. Each one of them describes a single moment of time in which a measurement of a certain group of parameters occurred. Because of this approach there are a lot of NA values in the data both rowwise and columnwise.
| Rows.in.the.dataset | Columns.in.the.dataset | Decisive.attributes | First.admission | Last.discharge |
|---|---|---|---|---|
| 6120 | 84 | 78 | 2020-01-10 15:52:20 | 2020-03-04 16:21:51 |
| Gender | Number of cases |
|---|---|
| Male | 224 |
| Female | 151 |
To create a correlation matrix all measurements of every patient have to be aggregated into a single row. Hence an aggregation method must be chosen for columns containing more than one value. In the following block there are three different data frames created. Each of them utilizes a different aggregating method - mean, max and last. The “last” method means that only the most recent data is taken into consideration. Then all of these data frames are used to create three correlation data frames with the use of a package names corrr which allows to omit the phase of creating a correlation matrix and converting it into a data frame. In the following blocks and explanations I will refer to these three methods as “median”, “mean” and “last” correlations.
The library corrr allows to select concrete attribute that the analysis needs to “focus” on, which means that it will filter out all the correlations not connected to the selected attribute. In this study we want to determine which attributes can cause which outcome of the disease, so the focused attribute is “outcome”. The results are shown below in a form of bar plots. To maintain readability of the plots only correlations higher than 0.6 or lower than -0.6 are shown. The bars can be hovered above to show precise values of the correlations.
The correlation plots show that no matter what the aggregation method is the same group of attributes attributes is correlated to the outcome the strongest. There are some differences, but overall these are the same attributes repeated three times. Because of that the following analysis will focus mostly on neutrophils (percentage), fibrin degradation products (since D-dimer is its subtype it won’t be included), lactate dehydrogenase, high-sensitivity C-reactive protein, calcium, prothombin activity, albumin and lymphocyte percentage.
There are several interactive plots presented in this section. For visualization purposes the timestamp of each measurement was normalized - the difference between the first the actual measurement time and the first measurement that a given patient had. As a result the Normalized_time variable contains the number of hours that had passed from the first examination the patient had had. This approach allows to visualize and compare courses of a certain attribute among numerous patients on a single plot.
This plot show some extremely chaotic data concerning deceased patients. There is practically no trend or anything more to say about this data expect for the levels of hsCRP are quite high comparing to these of the patients who lived. If we select only the Alive patients we can see that in almost every case the hsCRP was decreasing over time. This is because hsCRP is a blood test that measures the level of inflammation in one’s body, it’s used for example for determining the chance of a heart disease or a stroke. High value returned by hsCRP means high inflammation, what makes sense concerning that people with high hsCRP infected with COVID-19 died.
Fibrin degradation products are components of the blood produced by clot degeneration. The value of FDP is high after any thrombotic event. The chaotic data on the plot might indicate that the patients with high FDP (which are only those who died later on) suffered from some kind of a blood dysfunction.
Lactate dehydrogenase is an enzyme that is present in almost every living cell. Its high levels (up to 4 times larger in deceased patients than in alive ones) can indicate an early stage of heart attacks and in general are a negative prognostic factor.
Lower levels of calcium among deceased patients can indicate numerous things, however hypocalcemia can lead to several muscle-oriented problems, such as tetany or even disruption of conductivity in the cardiac tissue. The effect of low calcium levels has been researched and can be read about in this article.
Prothrombin is a coagulation factor. This means that its role is to manage the clotting process. Low levels of prothrombin activity are related to fibrin degradation products. Low levels of prothrombin activity that occured among deceased patients can indicate problems with the clotting process.
Albumin is a main protein that occurs in the human blood, being about 60% of all the proteins. Its main role is to maintain proper oncotic pressure, that prevents leakages of water containing electrolytes from the blood vessels into tissues. A healthy person should have albumin level ranging from 30 to 55 mg/ml of blood.
Lymphocytes are, next to neutroils, one of five kinds of white blood cells. Low levels of lymphocytes can indicate autoimmune diseases, AIDS or other infectious diseases.
The dataset for the classification problem cannot contain NA variables if Random Forest is used as a training method. Because of that only several columns were chosen for the classification problem: * Lymphocyte percentage * Neutrophils percentage * High-sensitivity C-reactive protein * Lactate dehydrogenase * Albumin
These are the attributes that showed the highest correlation with the outcome, as shown in “Determining the correlation” section.
## Size of the training set: 247
## Size of the testing set: 104
## Random Forest
##
## 247 samples
## 5 predictor
## 2 classes: 'Alive', 'Dead'
##
## No pre-processing
## Resampling: Cross-Validated (2 fold, repeated 5 times)
## Summary of sample sizes: 124, 123, 124, 123, 123, 124, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9684173 0.9363620
## 3 0.9651915 0.9297891
## 5 0.9538618 0.9068729
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alive Dead
## Alive 54 1
## Dead 3 46
##
## Accuracy : 0.9615
## 95% CI : (0.9044, 0.9894)
## No Information Rate : 0.5481
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9226
##
## Mcnemar's Test P-Value : 0.6171
##
## Precision : 0.9818
## Recall : 0.9474
## F1 : 0.9643
## Prevalence : 0.5481
## Detection Rate : 0.5192
## Detection Prevalence : 0.5288
## Balanced Accuracy : 0.9630
##
## 'Positive' Class : Alive
##
## Random Forest
##
## 247 samples
## 5 predictor
## 2 classes: 'Alive', 'Dead'
##
## Pre-processing: centered (5), scaled (5)
## Resampling: Cross-Validated (2 fold, repeated 5 times)
## Summary of sample sizes: 124, 123, 124, 123, 123, 124, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 1 0.9877812 0.9630597 0.9517857
## 2 0.9883376 0.9674495 0.9732143
## 3 0.9906038 0.9645083 0.9732143
## 4 0.9877185 0.9630597 0.9571429
## 5 0.9868392 0.9615672 0.9553571
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alive Dead
## Alive 55 1
## Dead 2 46
##
## Accuracy : 0.9712
## 95% CI : (0.918, 0.994)
## No Information Rate : 0.5481
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9419
##
## Mcnemar's Test P-Value : 1
##
## Precision : 0.9821
## Recall : 0.9649
## F1 : 0.9735
## Prevalence : 0.5481
## Detection Rate : 0.5288
## Detection Prevalence : 0.5385
## Balanced Accuracy : 0.9718
##
## 'Positive' Class : Alive
##
Accuracy is 1 percentage point better than before parameter tuning, Kappa value is 0,02 higher, values of the remaining measures are the same or higher than before. Because of a very high accuracy of the Random Forest method no further methods were tested.
Both high precision and recall mean that the classificator performs well, since it doesn’t return much false positives or false negatives. Not detecting ill people can be however quite problematic since it could increase the strain on the medical system even more.
## rf variable importance
##
## Overall
## Lactate_dehydrogenase 74.610
## hsCRP 27.151
## neutrophils_percent 15.245
## lymphocyte_percent 3.140
## albumin 2.119
The trained model shows that lactate dehydrogenase levels have the largest impact in defining whether a patient will die or not. High-sensitivity C-reactive protein is more than 2 times less important and the neutrophils percentage comes in at the third place. This outcome is confirmed by the article from which the dataset was downloaded.